Anti-Serendipity: Finding Useless Documents and Similar Documents

نویسندگان

  • James W. Cooper
  • John M. Prager
چکیده

The problem of finding your way through a relatively unknown collection of digital documents can be daunting. Such collections sometimes have few categories and little hierarchy, or they have so much hierarchy that valuable relations between documents can easily become obscured. We describe here how our work in the area of termrecognition and sentence-based summarization can be used to filter the document lists that we return from searches. We can thus remove or downgrade the ranking of some documents that have limited utility even though they may match many of the search terms fairly accurately. We also describe how we can use this same system to find documents that are closely related to a document of interest, thus continuing our work to provide tools for query-free searching.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

serendiPDF with Searchable Math-fields in PDF Documents

serendiPDF is an attempt to make it easier to find the correct way to express complicated mathematics, especially aligned environments, using LATEX. This is achieved by storing a copy of the LATEX source for a mathematical environment inside the generated PDF output, in a way that allows it to be easily accessed and copied into the source for other documents. In this way, the full power of “ser...

متن کامل

A Method for Finding Similar Documents Relying on Adding Repetition of Symbols in Length Based Filtering

A basic topic in mining of massive dataset is finding similar items. As an example, finding similar documents can be recommended. In this case many methods are existed. For example, Shingling method and length based filtering are one of them. In Shingling method, from each document, substrings have been selected with symbol name and, they are placed on one set. For finding similar documents, th...

متن کامل

Comparison of Strategic Plans of Universities and Institutes of Higher Education with a Quantitative Approach

Strategic planning in Iranian universities and institutes of higher education is generally prepared using strategic planning models introduced by experts and other universities. These programs will be published in the form of university strategic planning documents. These documents have such features that can be similar or different than the programming templates used. Existence of the similar...

متن کامل

Finding Meaningful Regions Containing Given Keywords from Large Text Collections

Introduction When we search a large text collection for documents we want, we will specify some keywords and we obtain documents containing the keywords. Because the search result contains many documents, it is important to rank them. Though some methods are proposed for ranking, they did not consider positions of the keywords. As a result, such documents sometimes may be useless because the ke...

متن کامل

Eliminating Useless Parts in Semi-structured Documents Using Alternation Counts

We propose a preprocessing method for Web mining which, given semi-structured documents with the same structure and style, distinguishes useless parts and non-useless parts in each document without any knowledge on the documents. It is based on a simple idea that any n-gram is useless if it appears frequently. To decide an appropriate pair of length n and frequency a, we introduce a new statist...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000